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Abstract —Due to the simplicity of their implementations, least 
square support vector machine (LS-SVM) and proximal sup¬ 
port vector machine (PSVM) have been widely used in binary 
classification applications. The conventional LS-SVM and PSVM 
cannot be used in regression and multiclass classification appli¬ 
cations directly, although variants of LS-SVM and PSVM have 
been proposed to handle such cases. This paper shows that both 
LS-SVM and PSVM can be simplified further and a unified 
learning framework of LS-SVM, PSVM, and other regularization 
algorithms referred to extreme learning machine (ELM) can be 
built. ELM works for the “generalized” single-hidden-layer feed¬ 
forward networks (SLFNs), but the hidden layer (or called feature 
mapping) in ELM need not be tuned. Such SLFNs include but are 
not limited to SVM, polynomial network, and the conventional 
feedforward neural networks. This paper shows the following: 
1) ELM provides a unified learning platform with a widespread 
type of feature mappings and can be applied in regression and 
multiclass classification applications directly; 2) from the opti¬ 
mization method point of view, ELM has milder optimization con¬ 
straints compared to LS-SVM and PSVM; 3) in theory, compared 
to ELM, LS-SVM and PSVM achieve suboptimal solutions and 
require higher computational complexity; and 4) in theory, ELM 
can approximate any target continuous function and classify any 
disjoint regions. As verified by the simulation results, ELM tends 
to have better scalability and achieve similar (for regression and 
binary class cases) or much better (for multiclass cases) generaliza¬ 
tion performance at much faster learning speed (up to thousands 
times) than traditional SVM and LS-SVM. 

Index Terms —Extreme learning machine (ELM), feature 
mapping, kernel, least square support vector machine (LS-SVM), 
proximal support vector machine (PSVM), regularization 
network. 

I. Introduction 

I N THE PAST two decades, due to their surprising classi¬ 
fication capability, support vector machine (SVM) [1] and 
its variants [2]—[4] have been extensively used in classification 
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applications. SVM has two main learning features: 1) In SVM, 
the training data are first mapped into a higher dimensional 
feature space through a nonlinear feature mapping function 
</>(x), and 2) the standard optimization method is then used 
to find the solution of maximizing the separating margin of 
two different classes in this feature space while minimizing the 
training errors. With the introduction of the epsilon-insensitive 
loss function, the support vector method has been extended to 
solve regression problems [5]. 

As the training of SVMs involves a quadratic programming 
problem, the computational complexity of SVM training al¬ 
gorithms is usually intensive, which is at least quadratic with 
respect to the number of training examples. It is difficult to deal 
with large problems using single traditional SVMs [6]; instead, 
SVM mixtures can be used in large applications [6], [7]. 

Least square SVM (LS-SVM) [2] and proximal SVM 
(PSVM) [3] provide fast implementations of the traditional 
SVM. Both LS-SVM and PSVM use equality optimization 
constraints instead of inequalities from the traditional SVM, 
which results in a direct least square solution by avoiding 
quadratic programming. 

SVM, LS-SVM, and PSVM are originally proposed for bi¬ 
nary classification. Different methods have been proposed in or¬ 
der for them to be applied in multiclass classification problems. 
One-against-all (OAA) and one-against-one (OAO) methods 
are mainly used in the implementation of SVM in multiclass 
classification applications [8]. OAA-SVM consists of m SVMs, 
where m is the number of classes. The zth SVM is trained 
with all of the samples in the zth class with positive labels and 
all the other examples from the remaining m — 1 classes with 
negative labels. OAO-SVM consists of m(m — 1)/2 SVMs, 
where each is trained with the samples from two classes only. 
Some encoding schemes such as minimal output coding (MOC) 
[9] and Bayesian coding-decoding schemes [10] have been pro¬ 
posed to solve multiclass problems with LS-SVM. Each class 
is represented by a unique binary output codeword of m bits, m 
outputs are used in MOC-LS-SVM in order to scale up to 2 m 
classes [9]. Bayes’ rule-based LS-SVM uses m binary LS-SVM 
plug-in classifiers with its binary class probabilities inferred in 
a second step within the related probabilistic framework [10]. 
With the prior multiclass probabilities and the posterior binary 
class probabilities, Bayes’ rule is then applied m times to infer 
posterior multiclass probabilities [10]. Bayes’ rule and different 
coding scheme are used in PSVM for multiclass problems [11]. 

The decision functions of binary SVM, LS-SVM, and PSVM 
classifiers have the same form 

N \ 

^2aitiK(x,Xi) + b\ (1) 

i=l / 
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where ti is the corresponding target class label of the training 
data x,;, a,; is the Lagrange multiplier to be computed by the 
learning machines, and iT(u, v) is a suitable kernel function to 
be given by users. From the network architecture point of view, 
SVM, LS-SVM, and PSVM can be considered as a specific 
type of single-hidden-layer feedforward network (SLFN) (the 
so-called support vector network termed by Cortes and Vapnik 
[ 1]) where the output of the ith hidden node is K (x, x, ) and the 
output weight linking the ith hidden node to the output node 
is aiU. The term bias b plays an important role in SVM, LS- 
SVM, and PSVM, which produces the equality optimization 
constraints in the dual optimization problems of these meth¬ 
ods. For example, the only difference between LS-SVM and 
PSVM is on how to use the bias b in the optimization formula 
while they have the same optimization constraint, resulting 
in different least square solutions. No learning parameter in 
the hidden-layer output function (kernel) K(u,v) needs to be 
tuned by SVM, LS-SVM, and PSVM, although some user- 
specified parameter needs to be chosen a priori. 

Extreme learning machine (ELM) [ 12]—[ 16] studies a much 
wider type of “generalized” SLFNs whose hidden layer need 
not be tuned. ELM has been attracting the attentions from more 
and more researchers [17]-[22], ELM was originally developed 
for the single-hidden-layer feedforward neural networks [12]— 
[ 14] and then extended to the “generalized” SLFNs which may 
not be neuron alike [15], [16] 

/(x) = h(x)/3 (2) 

where h(x) is the hidden-layer output corresponding to the 
input sample x and /3 is the output weight vector between the 
hidden layer and the output layer. One of the salient features 
of ELM is that the hidden layer need not be tuned. Essentially, 
ELM originally proposes to apply random computational nodes 
in the hidden layer, which are independent of the training 
data. Different from traditional learning algorithms for a neural 
type of SLFNs [23], ELM aims to reach not only the smallest 
training error but also the smallest norm of output weights. 
ELM [12], [13] and its variants [14]-[16], [24]-[28] mainly 
focus on the regression applications. Latest development of 
ELM has shown some relationships between ELM and SVM 
[18], [19], [29], 

Suykens and Vandewalle [30] described a training method for 
SLFNs which applies the hidden-layer output mapping as the 
feature mapping of SVM. However, the hidden-layer parame¬ 
ters need to be iteratively computed by solving an optimization 
problem (refer to the last paragraph in Section IV-A1 for 
details). As Suykens and Vandewalle stated in their work (see 
[30, p. 907]), the drawbacks of this method are the following: 
the high computational cost and larger number of parameters in 
the hidden layer. Liu et al. [18] and Frenay and Verleysen [19] 
show that the ELM learning approach can be applied to SVMs 
directly by simply replacing SVM kernels with (random) ELM 
kernels and better generalization can be achieved. Different 
from the study of Suykens and Vandewalle [30] in which the 
hidden layer is parametric, the ELM hidden layer used in the 
studies of Liu et al. [18] and Frenay and Verleysen [19] is 
nonparametric, and the hidden-layer parameters need not be 
tuned and can be fixed once randomly generated. Liu et al. [18] 
suggest to apply ELM kernel in SVMs and particularly study 


PSVM [3] with ELM kernel. Later, Frenay and Verleysen 
[19] show that the normalized ELM kernel can also be ap¬ 
plied in the traditional SVM. Their proposed SVM with ELM 
kernel and the conventional SVM have the same optimization 
constraints (e.g., both inequality constraints and bias b are 
used). Recently, Huang et al. [29] further show the following: 

1) SVM’s maximal separating margin property and the ELM’s 
minimal norm of output weight property are actually consistent, 
and with ELM framework, SVM’s maximal separating margin 
property and Barlett’s theory on feedforward neural networks 
remain consistent, and 2) compared to SVM, ELM requires 
fewer optimization constraints and results in simpler implemen¬ 
tation, faster learning, and better generalization performance. 
However, similar to SVM, inequality optimization constraints 
are used in [29]. Huang et al. [29] use random kernels and 
discard the term bias b used in the conventional SVM. However, 
no direct relationship has so far been found between the original 
ELM implementation [12]—[16] and LS-SVM/PSVM. Whether 
feedforward neural networks, SVM, LS-SVM, and PSVM can be 
unified still remains open. 

Different from the studies of Huang et al. [29], Liu et al. 
[18], and Frenay and Verleysen [19], this paper extends ELM 
to LS-SVM and PSVM and provides a unified solution for 
LS-SVM and PSVM under equality constraints. In particular, 
the following contributions have been made in this paper. 

1) ELM was originally developed from feedforward neural 
networks [12]—[16]. Different from other ELM work in 
literature, this paper manages to extend ELM to kernel 
learning: It is shown that ELM can use a wide type of fea¬ 
ture mappings (hidden-layer output functions), including 
random hidden nodes and kernels. With this extension, 
the unified ELM solution can be obtained for feedforward 
neural networks, RBF network, LS-SVM, and PSVM. 

2) Furthermore, ELM, which is with higher scalability and 
less computational complexity, not only unifies different 
popular learning algorithms but also provides a unified 
solution to different practical applications (e.g., regres¬ 
sion, binary, and multiclass classifications). Different 
variants of LS-SVM and SVM are required for different 
types of applications. ELM avoids such trivial and tedious 
situations faced by LS-SVM and SVM. In ELM method, 
all these applications can be resolved in one formula. 

3) From the optimization method point of view, ELM and 
LS-SVM have the same optimization cost function; how¬ 
ever, ELM has milder optimization constraints compared 
to LS-SVM and PSVM. As analyzed in this paper and 
further verified by simulation results over 36 wide types 
of data sets, compared to ELM, LS-SVM achieves subop- 
timal solutions (when the same kernels are used) and has 
higher computational complexity. As verified by simula¬ 
tions, the resultant ELM method can run much faster than 
LS-SVM. ELM with random hidden nodes can run even up 
to tens of thousands times faster than SVM and LS-SVM. 
Different from earlier ELM works which do not perform 
well in sparse data sets, the ELM method proposed in this 
paper can handle sparse data sets well. 

4) This paper also shows that the proposed ELM method not 
only has universal approximation capability (of approxi¬ 
mating any target continuous function) but also has clas¬ 
sification capability (of classifying any disjoint regions). 
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II. Brief of SVMs 

This section briefs the conventional SVM [1] and its variants, 
namely, LS-SVM [2] and PSVM [3]. 

A. SVM 


where each Lagrange multiplier on corresponds to a training 
sample (x,, f,:)- Vectors x^’s for which L(w • 0(xi) + b) = 1 
are termed support vectors [1]. 

Kernel functions I\ (u, v) = 0( u) • 0(v) are usually used in 
the implementation of SVM learning algorithm. In this case, 
we have 


Cortes and Vapnik [1] study the relationship between SVM 
and multilayer feedforward neural networks and showed that 
SVM can be seen as a specific type of SLFNs, the so-called 
support vector networks. In 1962, Rosenblatt [31] suggested 
that multilayer feedforward neural networks (perceptrons) can 
be trained in a feature space Z of the last hidden layer. In this 
feature space, a linear decision function is constructed 

/(x) = sign ^Z onziipt) j (3) 


N N 


N 


minimize: L Dsvm = - Z Z UtjOiajKixi,^) - Z 


*=1 3=1 


N 


subject to : Uoti = 0 


0 < ai < C, i = 1,..., N. 


( 6 ) 


The SVM kernel function K( u, v) needs to satisfy Mercer’s 
condition [1]. The decision function of SVM is 


where Zj(x) is the output of the ith neuron in the last hidden 
layer of a perceptron. In order to find an alternative solution 
of Zj(x), in 1995, Cortes and Vapnik [1] proposed the SVM 
which maps the data from the input space to some feature 
space Z through some nonlinear mapping 0(x) chosen a priori. 
Constrained-optimization methods are then used to find the 
separating hyperplane which maximizes the separating margins 
of two different classes in the feature space. 

Given a set of training data (x,;, £,), i = 1,..., N, where 
Xj G R d and t, G {—1,1}, due to the nonlinear separability of 
these training data in the input space, in most cases, one can 
map the training data x, : from the input space to a feature space 
Z through a nonlinear mapping 0 : x* —> 0(x,j. The distance 
between two different classes in the feature space Z is 2/||w||. 
To maximize the separating margin and to minimize the training 
errors, is equivalent to 

1 N 

Minimize : L Psvm = -||w|| 2 + CZ& 

i=1 

Subject to : £* (w • 0(xj) + b) > 1 — 

6 > 0 , * = 1 ,..., N 


/(x) = sign a st s K{x , x s ) + (7) 

where N s is the number of support vectors x s ’s. 

B. LS-SVM 

Suykens and Vandewalle [2] propose a least square version to 
SVM classifier. Instead of the inequality constraint (4) adopted 
in SVM, equality constraints are used in the LS-SVM [2]. 
Hence, by solving a set of linear equations instead of quadratic 
programming, one can implement the least square approach 
easily. LS-SVM is proven to have excellent generalization 
performance and low computational cost in many applications. 
In LS-SVM, the classification problem is formulated as 

1 1 N 

Minimize : Lp ls _ svm = -w • w + C- Z & 

Z i=l 

Subject to : ti( w ■ 0(x,) + 6) = 1 — i = l,. .. ,N. 

(8) 



where C is a user-specified parameter and provides a tradeoff 
between the distance of the separating margin and the training 
error. 

Based on the Karush-Kuhn-Tucker (KKT) theorem [32], to 
train such an SVM is equivalent to solving the following dual 
optimization problem: 


^ N N 

minimize : L Dsvm = - zz CXiOLj ) • (f)(^X.j ) 

*=1 3 = 1 
N 

-Z«< 

i= 1 


N 

subject to : tion = 0 

i=i 

0 < on < C, 


Based on the KKT theorem, to train such an LS-SVM is 
equivalent to solving the following dual optimization problem: 

1 1 N 

^Uls-svm = 2 w ' w + C 2 Zi 

N 

“Z <*i (ti + b) - 1 + £i) . (9) 

i—1 

Different from Lagrange multipliers (5) in SVM, in LS- 
SVM, Lagrange multipliers s can be either positive or 
negative due to the equality constraints used. Based on the 
KKT theorem, we can have the optimality conditions of (9) as 
follows: 


dL 


^LS-SVM 


<9w 


N 

= o -> W = Z otiti<l>(-Xi) 
1=1 


i = 1,... ,N 


(5) 


(10a) 
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Ulj D LS-SVM 

= 0 - 

N 

-» ^2 OLiti = 0 

2=1 

(10b) 

db 

^^LS-SVM 

= 0 - 

^ai = CZi, i = 1,..., N 

(10c) 


^-^-DlS-SVM 

don 

= 0 - 

-*■ U (w • 0(xj) + b) - 1 + & = 

o, 


i = l,...,N. (lOd) 


By substituting (10a)-(10c) into (lOd), the aforementioned 
equations can be equivalently written as 


' 0 T t 

' b' 


0 T t 

' b' 


o' 

_T + LJ L s-svm _ 

OL 


T i + ZZ T 

OL 


l 


( 11 ) 


where 


Z = 


tl0( x l) 

Lv0(X]V) 


svm = ZZ 


T 


( 12 ) 


The feature mapping </>(x) is a row vector, 1 T = [ti,t 2 , 
..., f at] t , a = [a 1 ,a 2 , ■ ■ •, aiiv] T , and 1 = [1,1,, 1] T . In 
LS-SVM, as </>(x) is usually unknown, Mercer’s condition [33] 
can be applied to matrix STls SVM 


12 L S—svMij = titj(f>(xi ) • 0(xj) = titjK(xi, Xj). (13) 


The mathematical model built for linear PSVM is 
1 1 N 

Minimize : L PpsvM = -(w • w + b 2 ) + C -^ g 

2 = 1 

Subject to : L(w ■ x, + 6) = 1 — , i = 1,..., N. (14) 

The corresponding dual optimization problem is 

1 1 N 
L Dpsvm = 2 ( W • w + &2 ) + C 2 £ 

i= 1 
N 

- Qj (tj(w ■ Xj + b) - 1 + &) • (15) 
2=1 

By applying KKT optimality conditions [similar to 
(10a)-(10d)], we have 

Qi + ^psvm + TT t ^) a = (J^ + ZZ T + TT t ^) a = 1 

(16) 

where Z = [iyXi,... ,f7vXAr] T and FipsvM = ZZ T . 

Similar to LS-SVM, the training data x can be mapped 
from the input space into a feature space 0 : x —» 0(x), 
and one can obtain the nonlinear version of PSVM: Z = 
[fi0(xi) T ,..., fjv0(xAr) T ] . As feature mapping 0(x) is 
usually unknown, Mercer’s conditions can be applied to ma¬ 
trix JlpsVM : flpSVM^ = titj(f){Xi) ■ 0(Xj) = titjK(x.i,Xj), 
which is the same as LS-SVM’s kernel matrix D ls-SVM (13). 
The decision function of PSVM classifier is /(x) = 
signal aiUK(x,Xi) + b). 


The decision function of LS-SVM classifier is /(x) = 

si g n (EiIi Q-iUK{x, Xj) + b). 

The Lagrange multipliers a^’s are proportional to the training 
errors £*’s in LS-SVM, while in the conventional SVM, many 
Lagrange multipliers a^’s are typically equal to zero. Compared 
to the conventional SVM, sparsity is lost in LS-SVM [9]; this 
is true to PSVM [3]. 

C. PSVM 

Fung and Mangasarian [3] propose the PSVM classifier, 
which classifies data points depending on proximity to either 
one of the two separation planes that are aimed to be pushed 
away as far apart as possible. Similar to LS-SVM, the key idea 
of PSVM is that the separation hyperplanes are not bounded 
planes anymore but “proximal” planes, and such effect is 
reflected in mathematical expressions that the inequality con¬ 
straints are changed to equality constraints. Different from LS- 
SVM, in the objective formula of linear PSVM, (w ■ w + b 2 ) 
is used instead of w • w, making the optimization problem 
strongly convex, and has little or no effect on the original 
optimization problem. 


1 In order to keep the consistent notation and formula formats, similar to LS- 
SVM [2], PSVM [3], ELM [29], and TER-ELM [22], feature mappings 0(x) 
and h(x) are defined as a row vector while the rest of the vectors are defined 
as column vectors in this paper unless explicitly specified. 


III. Proposed Constrained-Optimization-Based 
ELM 

ELM [ 12]—[ 14] was originally proposed for the single- 
hidden-layer feedforward neural networks and was then ex¬ 
tended to the generalized SLFNs where the hidden layer need 
not be neuron alike [15], [16]. In ELM, the hidden layer need 
not be tuned. The output function of ELM for generalized 
SLFNs (take one output node case as an example) is 

L 

/i( x ) = 'YhPihii'*) = h (x)/3 (17) 

2=1 

where (3 = [/3i,..., /3 l] t is the vector of the output weights 
between the hidden layer of L nodes and the output node and 
h(x) = [hi(x),. .., Hl{x)\ is the output (row) vector of the 
hidden layer with respect to the input x. h(x) actually maps the 
data from the ((-dimensional input space to the L-dimensional 
hidden-layer feature space ( ELM feature space) //, and thus, 
h(x) is indeed a feature mapping. For the binary classification 
applications, the decision function of ELM is 

/l(x) = sign (h(x)/3). (18) 

Different from traditional learning algorithms [23], ELM 
tends to reach not only the smallest training error but also 
the smallest norm of output weights. According to Bartlett’s 
theory [34], for feedforward neural networks reaching smaller 
training error, the smaller the norms of weights are, the better 
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generalization performance the networks tend to have. We 
conjecture that this may be true to the generalized SLFNs where 
the hidden layer may not be neuron alike [15], [16]. ELM is to 
minimize the training error as well as the norm of the output 
weights [12], [13] 

Minimize : ||H/3-T|| 2 and \\(3\\ (19) 

where H is the hidden-layer output matrix 


h(xi) ' 


'/ii(xi) • 

• h L (x i) " 

h(xjv) _ 


./ll(xjv) 

M x w)- 


Seen from (18), to minimize the norm of the output weights 
11/311 is actually to maximize the distance of the separating 
margins of the two different classes in the ELM feature space: 

2 /||/ 3 ||. 

The minimal norm least square method instead of the stan¬ 
dard optimization method was used in the original implementa¬ 
tion of ELM [12], [13] 

/3 = H f T (21) 

where H' is the Moore-Penrose generalized inverse of matrix 
H [35], [36]. Different methods can be used to calculate the 
Moore-Penrose generalized inverse of a matrix: orthogonal 
projection method, orthogonalization method, iterative method, 
and singular value decomposition (SVD) [36]. The orthogonal 
projection method [36] can be used in two cases: when H r H 
is nonsingular and = (H T H) 1 H T , or when HH T is 
nonsingular and = H t (HH t ) \ 

According to the ridge regression theory [37], one can add 
a positive value to the diagonal of H r H or HH t ; the resul¬ 
tant solution is stabler and tends to have better generalization 
performance. Toh [22] and Deng et al. [21] have studied the 
performance of ELM with this enhancement under the Sigmoid 
additive type of SLFNs. This section extends such study to gen¬ 
eralized SLFNs with a different type of hidden nodes (feature 
mappings) as well as kernels. 

There is a gap between ELM and LS-SVM/PSVM, and it is 
not clear whether there is some relationship between ELM and 
LS-SVM/PSVM. This section aims to fill the gap and build the 
relationship between ELM and LS-SVM/PSVM. 


A. Sufficient and Necessary Conditions for Universal 
Classifiers 

1) Universal Approximation Capability: According to ELM 
learning theory, a widespread type of feature mappings h(x) 
can be used in ELM so that ELM can approximate any con¬ 
tinuous target functions (refer to [14]—[16] for details). That is, 
given any target continuous function /(x), there exists a series 
of /Vs such that 


lim ||/l(x) - /(x)|| 

L —>+oo 


lim 

L— »+oo 


L 

^]/3A(x) - /(x) 

i=1 


= 0 . 
( 22 ) 


With this universal approximation capability, the bias b in the 
optimization constraints of SVM, LS-SVM, and PSVM can be 


removed, and the resultant learning algorithm has milder op¬ 
timization constraints. Thus, better generalization performance 
and lower computational complexity can be obtained. In SVM, 
LS-SVM, and PSVM, as the feature mapping </j(x,) may be 
unknown, usually not every feature mapping to be used in 
SVM, LS-SVM, and PSVM satisfies the universal approxima¬ 
tion condition. Obviously, a learning machine with a feature 
mapping which does not satisfy the universal approximation 
condition cannot approximate all target continuous functions. 
Thus, the universal approximation condition is not only a 
sufficient condition but also a necessary condition for a feature 
mapping to be widely used. This is also true to classification 
applications. 

2) Classification Capability: Similar to the classification 
capability theorem of single-hidden-layer feedforward neural 
networks [38], we can prove the classification capability of 
the generalized SLFNs with the hidden-layer mapping h(x) 
satisfying the universal approximation condition (22). 

Definition 3.1: A closed set is called a region regardless 
whether it is bounded or not. 

Lemma 3.1 [38]: Given disjoint regions K\. K?, .... K m 
in R d and the corresponding m arbitrary real values 
Ci, C 2 ,..., c m , and an arbitrary region X disjointed from any 
Ki, there exists a continuous function /(x) such that /(x) = c, 
if x g Ki and /(x) = Co if x £ X, where cq is an arbitrary real 
value different from ci, C 2 ,..., c p . 

The classification capability theorem of Huang et al. [38] can 
be extended to generalized SLFNs which need not be neuron 
alike. 

Theorem 3.1: Given a feature mapping h(x), if h(x)/3 is 
dense in C(R d ) or in C(M), where M is a compact set of 
R' 7 . then a generalized SLFN with such a hidden-layer mapping 
h(x) can separate arbitrary disjoint regions of any shapes in R' / 
otM. 

Proof: Given m disjoint regions K\. K 2 - ■ • •, K m in R' / 
and their corresponding m labels Ci, C 2 ,..., c m , according to 
Lemma 3.1, there exists a continuous function /(x) in C(R <1 ) 
or on one compact set of R' / such that /(x) = c, if x £ K, . 
Hence, if h(x)/3 is dense in C(R d ) or on one compact set 
of R d , then it can approximate the function /(x), and there 
exists a corresponding generalized SLFN to implement such 
a function /(x). Thus, such a generalized SLFN can separate 
these decision regions regardless of shapes of these regions. □ 

Seen from Theorem 3.1, it is a necessary and sufficient con¬ 
dition that the feature mapping h(x) is chosen to make h(x)/3 
have the capability of approximating any target continuous 
function. If h(x)/3 cannot approximate any target continuous 
functions, there may exist some shapes of regions which cannot 
be separated by a classifier with such feature mapping h(x). 
In other words, as long as the dimensionality of the feature 
mapping (number of hidden nodes L in a classifier) is large 
enough, the output of the classifier h(x)/3 can be as close to the 
class labels in the corresponding regions as possible. 

In the binary classification case, ELM only uses a single¬ 
output node, and the class label closer to the output value of 
ELM is chosen as the predicted class label of the input data. 
There are two solutions for the multiclass classification case. 

1) ELM only uses a single-output node, and among the 
multiclass labels, the class label closer to the output value 
of ELM is chosen as the predicted class label of the 
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input data. In this case, the ELM solution to the binary 
classification case becomes a specific case of multiclass 
solution. 

2) ELM uses multioutput nodes, and the index of the output 
node with the highest output value is considered as the 
label of the input data. 

For the sake of readability, these two solutions are analyzed 
separately. It can be found that, eventually, the same solution 
formula is obtained for both cases. 

B. Simplified Constrained-Optimization Problems 

1) Multiclass Classifier With Single Output: Since ELM can 
approximate any target continuous functions and the output of 
the ELM classifier h(x)/3 can be as close to the class labels 
in the corresponding regions as possible, the classification 
problem for the proposed constrained-optimization-based ELM 
with a single-output node can be formulated as 

1 1 N 
Minimize : L Pelm = - 1|/3|| 2 + 

Z 2 = 1 

Subject to : h(x i )/3 = fj — i = l,...,N. (23) 


where = [£^i,..., £ ijm ] T is the training error vector of the 
to output nodes with respect to the training sample x, : . Based 
on the KKT theorem, to train ELM is equivalent to solving the 
following dual optimization problem: 

1 1 N 

L D ^ M = ^m 2 +c-Y J m\ 2 

N m 

- a h3 ( h ( x *)/3j - tij + €i,j) (27) 

i =1 3 = 1 

where /3 ; is the vector of the weights linking hidden layer to the 
jth output node and (3 = \(3 1 ,..., f3 m }. We can have the KKT 
corresponding optimality conditions as follows: 

=0 -*• Pj = '52 ai ’Mxi) T /3 = H Tq (28 a) 

i=i 

dL ^ M =° <Xi = C£i, i = (28b) 

d£i 

^^L=0^h( Xi )/3-tJ + ^ = 0, i = l,... , N 


Based on the KKT theorem, to train ELM is equivalent to 
solving the following dual optimization problem: 

1 N N 

l d blm = 2 M 2 + ^ -*i+ & 

L J = 1 2=1 

(24) 


where each Lagrange multiplier a, corresponds to the jth 
training sample. We can have the KKT optimality conditions 
of (24) as follows: 


8(3 


^Aelm 

8£,i 

^-DeLM 

don 


N 


= 0 - 

-0 = 

= 5^ai h (xi) T = H t ck 

2=1 

(25a) 

= 0 - 


= C&, i = l,...,N 

(25b) 

= 0 - 

-*■ h(xj)/3 - ti + & = 0, «=!,.. 

■,N 


(25c) 


where a. = [ot \,..., on] t ■ 

2) Multiclass Classifier With Multioutputs: An alternative 
approach for multiclass applications is to let ELM have mul¬ 
tioutput nodes instead of a single-output node, m-class of 
classifiers have m output nodes. If the original class label is 
p, the expected output vector of the m output nodes is t, = 

V 

[0,..., 0,1,0,..., 0] T . In this case, only the pth element of 
tj = [t^i,... ,L im ] T is one, while the rest of the elements are 
set to zero. The classification problem for ELM with multiout¬ 
put nodes can be formulated as 


(28c) 


where on = [a,,i,... ,a ltm } T and a = [«i,... ,aAr] T . 

It can be seen from (24), (25a)-(25c), (27), and (28a)-(28c) 
that the single-output node case considered a specific case of 
multioutput nodes when the number of output nodes is set to 
one: m = 1. Thus, we only need to consider the multiclass 
classifier with multioutput nodes. For both cases, the hidden- 
layer matrix H (20) remains the same, and the size of H is only 
decided by the number of training samples N and the number 
of hidden nodes L, which is irrelevant to the number of output 
nodes (number of classes). 


C. Equality Constrained-Optimization-Based ELM 

Different solutions to the aforementioned KKT conditions 
can be obtained based on the concerns on the efficiency in 
different size of training data sets. 

1) For the Case Where the Number of Training Samples is 
Not Huge: In this case, by substituting (28a) and (28b) into 
(28c), the aforementioned equations can be equivalently written 
as 

^+HH t Jq = T (29) 

where 


r t T i 

r i 


’ £ll 

hm 

t T 

L h N J 


-tNl ’ 

^ Nm - 


i i ' 

Minimize : L Pelm = -\\/3 \\ 2 + C- Uif 

^ 2=1 


From (28a) and (29), we have 

(8 = H T (^I+HH t ) 


T. 


Subject to : h(xj)/3 = tj — , 


i = l,...,N (26) 


(31) 
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The output function of ELM classifier is 


f (x) = h(x)/3 = h(x)H T 


c +H “ T 


T. 


(32) 


1) Single-output node (to = 1): For multiclass classifica¬ 
tions, among all the multiclass labels, the predicted class 
label of a given testing sample is closest to the output of 
ELM classifier. For binary classification case, ELM needs 
only one output node (to = 1), and the decision function 
of ELM classifier is 


/(x) = sign ( h(x)H T (- + HH 


(33) 


2) Multioutput nodes (m > 1): For multiclass cases, the 
predicted class label of a given testing sample is the index 
number of the output node which has the highest output 
value for the given testing sample. Let /j(x) denote 
the output function of the jth output node, i.e., f(x) = 
[/^(x),..., / m (x)] T ; then, the predicted class label of 
sample x is 


label(x.) = arg max /)(x). (34) 


2) For the Case Where the Number of Training Samples is 
Huge: If the number of training data is very large, for example, 
it is much larger than the dimensionality of the feature space, 
N L, we have an alternative solution. From (28a) and (28b), 
we have 


Remark: Although the alternative approaches for the differ¬ 
ent size of data sets are discussed and provided separately, in 
theory, there is no specific requirement on the size of the train¬ 
ing data sets in all the approaches [see (32) and (38)], and all the 
approaches can be used in any size of applications. However, 
different approaches have different computational costs, and 
their efficiency may be different in different applications. In 
the implementation of ELM, it is found that the generalization 
performance of ELM is not sensitive to the dimensionality of 
the feature space (L) and good performance can be reached 
as long as L is large enough. In our simulations, L = 1000 is 
set for all tested cases no matter whatever size of the training 
data sets. Thus, if the training data sets are very large N L, 
one may prefer to apply solutions (38) in order to reduce 
computational costs. However, if a feature mapping h(x) is 
unknown, one may prefer to use solutions (32) instead (which 
will be discussed later in Section IV). 

IV. Discussions 

A. Random Feature Mappings and Kernels 

1) Random Feature Mappings: Different from SVM, LS- 
SVM, and PSVM, in ELM, a feature mapping (hidden-layer 
output vector) h(x) = [/^(x),..., Ll(x)] is usually known 
to users. According to [15] and [16], almost all nonlinear 
piecewise continuous functions can be used as the hidden-node 
output functions, and thus, the feature mappings used in ELM 
can be very diversified. 

For example, as mentioned in [29], we can have 


f3 = GH T £ 

From (28c), we have 

H/3 — T + -^(H T ) t ( 3 = 0 

o 

H T ^H+^(H t )^/3 = H t T 

(3=(^ + H t h) H t T. 


In this case, the output function of ELM classifier is 


f(x) = h(x)/3 = h(x) ( - + H t H ) H t T. 


(35) h(x) = [G(ai,6i,x),...,G(a L ,& L ,x)] (40) 

(36) where G(a, b. x) is a nonlinear piecewise continuous function 
satisfying ELM universal approximation capability theorems 
[14]—[16] and {(aare randomly generated according 
to any continuous probability distribution. For example, such 
nonlinear piecewise continuous functions can be as follows. 

1) Sigmoid function 


(37) 


(38) 


2 ) 


3) 


G(a, 6, x) = 

Hard-limit function 
G(a, b, x) = 
Gaussian function 


1 

1 + exp (-(a • x + b))' 


f 1, if a • x — 6 > 0 
(0, otherwise. 


(41) 


(42) 


1) Single-output node (to = 1): For multiclass classifica¬ 
tions, the predicted class label of a given testing sample 
is the class label closest to the output value of ELM clas¬ 
sifier. For binary classification case, the decision function 
of ELM classifier is 


G(a, b, x) = exp (—6||x — a|| 1 2 ) . (43) 

4) Multiquadric function 

G(a, 6, x) = (||x — a|| 2 + b 2 ) 1/2 . (44) 


/(x) = sign 






(39) 


2) Multioutput nodes (to > 1): The predicted class label of 
a given testing sample is the index of the output node 
which has the highest output. 


Sigmoid and Gaussian functions are two of the major hidden- 
layer output functions used in the feedforward neural networks 
and RBF networks 2 , respectively. Interestingly, ELM with 

2 Readers can refer to [39] for the difference between ELM and RBF 
networks. 
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hard-limit [24] and multiquadric functions can have good 
generalization performance as well. 

Suykens and Vandewalle [30] described a training method for 
SLFNs which applies the hidden-layer output mapping as the 
feature mapping of SVM. However, different from ELM where 
the hidden layer is not parametric and need not be tuned, the 
feature mapping of their SVM implementation is parametric, 
and the hidden-layer parameters need to be iteratively computed 
by solving an optimization problem. Their learning algorithm 
was briefed as follows: 

minimize : ||rw || 2 (45) 

subject to : 

Cl : QP subproblem : 

N 

w = a*ti tanh(Vxj + B) 

i =1 

a* = arg max Q (a,; tanh(Vx, : + B)) 

Oti 

0<a*<c 
C2: ||V(:);B || 2 < 7 
C3 : r is radius of smallest ball containing 

{tanh(V Xl ) + B}^ (46) 

where V denotes the interconnection matrix for the hidden 
layer, B is the bias vector, (:) is a columnwise scan of the 
interconnection matrix for the hidden layer, and 7 is a positive 
constant. In addition, Q is the cost function of the correspond¬ 
ing SVM dual problem 

max Q (a*; iT(x,;,Xj)) 

Oii 

1 N N N 

= ~2 '52'52 t i t o a i a jK(x i ,yt j ) (47) 

i= 1 j —1 2=1 

QP subproblems need to be solved for hidden-node parame¬ 
ters V and B, while the hidden-node parameters of ELM are 
randomly generated and known to users. 

2) Kernels: If a feature mapping h(x) is unknown to users, 
one can apply Mercer’s conditions on ELM. We can define a 
kernel matrix for ELM as follows: 

T2elm = HH t : f2 E LMj,i = M x i) ’ M x i) = A'(x i; Xj). 

(48) 

Then, the output function of ELM classifier (32) can be 
written compactly as 

f(x)=h(x)H T ^ + HH T J T 

/C(x,xi) 

_iC(x,x N ) 

In this specific case, similar to SVM, LS-SVM, and PSVM, 
the feature mapping h(x) need not be known to users; 
instead, its corresponding kernel K(u, v) (e.g., iT(u, v) = 
exp(— 7 ||u — v|| 2 )) is given to users. The dimensionality L of 
the feature space (number of hidden nodes) need not be given 
either. 



3) Feature Mapping Matrix: In ELM, H = 
[h(xi) T ,..., h(xjv ) T ] 1 is called the hidden-layer output 
matrix (or called feature mapping matrix) due to the fact 
that it represents the corresponding hidden-layer outputs of 
the given N training samples. h(xj) denotes the output of 
the hidden layer with regard to the input sample x,. Feature 
mapping h(xj) maps the data x, from the input space to the 
hidden-layer feature space, and the feature mapping matrix H 
is irrelevant to target f. As observed from the essence of the 
feature mapping, it is reasonable to have the feature mapping 
matrix independent from the target values tf s. However, 
in both LS-SVM and PSVM, the feature mapping matrix 
Z = [fi0(xi) T ,..., fAr0(xjv) T ] T (12) is designed to depend 
on the targets tf s of the training samples x, ’s. 


B. ELM: Unified Learning Mode for Regression, Binary, and 
Multiclass Classification 

As observed from (32) and (38), ELM has the unified solu¬ 
tions for regression, binary, and multiclass classification. The 
kernel matrix F2 elm = HH T is only related to the input data 
Xj and the number of training samples. The kernel matrix 
F 2 E lm is neither relevant to the number of output nodes m nor 
to the training target values tf s. However, in multiclass LS- 
SVM, aside from the input data x, , the kernel matrix LIm (52) 
also depends on the number of output nodes m and the training 
target values tf s. 

For the multiclass case with m labels, LS-SVM uses m 
output nodes in order to encode multiclasses where tij denotes 
the output value of the j th output node for the training data 
Xj [10]. The m outputs can be used to encode up to 2 m different 
classes. For multiclass case, the primal optimization problem of 
LS-SVM can be given as [10] 

1 m 1 N m 

Minimize : Lp 1 * = - w 7 - • w, + C- £? 

ELs-SVM 2 ' J J J 2 ' J J 

j= 1 *=1 2=1 

{ *», 1 (wi • 01 (Xj) + bi) = 1 - &,! 

*i,2 (W 2 • 0 2 (Xj) + 62 ) = 1 - 6,2 

ti,m (w m • 0 m (xj) -f — 1 £i,m 

i = l,...,N. (50) 


Similar to the LS-SVM solution (11) to the binary clas¬ 
sification, with KKT conditions, the corresponding LS-SVM 
solution for multiclass cases can be obtained as follows: 


' 0 

T t " 


b m 


o' 

T 

CIm 


O-M _ 


1 


fijif = blockdiag 


F 2 (1) + 



= tkjtijK^Qck,^) 

b M = [bi, ■ ■ ■,b m \ 


I 

C 


OLM = [ai,l, ■ • ■ , 0!JV,1) • ■ • , «l,m, • ■ ■ J OtN,m] 

K^\x k ,xi) = (fij (x fc ) • (pjixi) 

I 2 \ 


/ Xfc-Xi, . 

= ex pi— ^—y 7 =i > 


(51) 


(52) 

N. 

(53) 
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Fig. 1. Scalability of different classifiers: An example on Letter data set. The 
training time spent by LS-SVM and ELM (Gaussian kernel) increases sharply 
when the number of training data increases. However, the training time spent 
by ELM with Sigmoid additive node and multiquadric function node increases 
very slowly when the number of training data increases. 


Seen from (52), the multiclass LS-SVM actually uses m 
binary-class LS-SVM concurrently for m class labels of clas¬ 
sifications; each of the m binary-class LS-SVMs may have 
different kernel matrix fL'j j = 1,... ,m. However, in any 
cases, ELM has one hidden layer linking to all the m out¬ 
put nodes. In multiclass LS-SVM, different kernels may be 
used in each individual binary LS-SVM, and the jth LS-SVM 
uses kernel K^\u, v). Take Gaussian kernel as an example, 
K^\u, v) = exp(—(|jxfc — x/ 1 | 2 /cr|)); from practical point of 
view, it may be time consuming and tedious for users to choose 
different kernel parameters cr,;, and thus, one may set a common 
value <Ji = 1 7 for all the kernels. In multiclass LS-SVM, the 
size of fl m is N x Nm, which is related to the number of 
output nodes m. However, in ELM, the size of kernel matrix 
$”2 elm = HH t is TV x N, which is fixed for all the regression, 
binary, and multiclass classification cases. 


C. Computational Complexity and Scalability 

For LS-SVM and PSVM, the main computational cost comes 
from calculating the Lagrange multipliers cc’s based on (11) and 
(16). Obviously, ELM computes a. based on a simpler method 
(29). More importantly, in large-scale applications, instead of 
HH t (size: N x N ), ELM can get a solution based on (37), 
where H T H (size: L x L) is used. As in most applications, 
the number of hidden nodes L can be much smaller than the 
number of training samples: L<#, the computational cost 
reduces dramatically. For the case L •C N, ELM can use H T H 
(size: L x L). Compared with LS-SVM and PSVM which use 
HH t (size: N x N), ELM has much better computational 
scalability with regard to the number of training samples N. 
(cf. Fig. 1 for example.) 

In order to reduce the computational cost of LS-SVM in 
large-scale problems, fixed-size LS-SVM has been proposed 
by Suykens et al. [40]—[44]. Fixed-size LS-SVM uses an 
M-sample subset of the original training data set (M <C N ) 
to compute a finite dimensional approximation 0 (x) to the 
feature map </>(x). However, different from the fixed-size LS- 
SVM, if L <C N, L x L solution of ELM still uses the entire 
N training samples. In any case, the feature map h(x) of ELM 


is not approximated. In fact, the feature map h(x) of ELM is 
randomly generated and independent of the training samples 
(if random hidden nodes are used). The kernel matrix of the 
fixed-size LS-SVM is built with the subset of size M <C N, 
while the kernel matrix of ELM is built with the entire data set 
of size N in all cases. 


D. Difference From Other Regularized ELMs 


Toh [22] and Deng et al. [21] proposed two different types of 
weighted regularized ELMs. 

The total error rate (TER) ELM [22] uses m output nodes 
for m class label classification applications. In TER-ELM, the 
counting cost function adopts a quadratic approximation. The 
OAA method is used in the implementation of TER-ELM in 
multiclass classification applications. Essentially, TER-ELM 
consists of m binary TER-ELM, where jth TER-ELM is trained 
with all of the samples in the jth class with positive labels 
and all the other examples from the remaining m — 1 classes 
with negative labels. Suppose that there are rnj number of 
positive category patterns and mf number of negative category 
patterns in the jth binary TER-ELM. We have a positive output 
y + = (t + 77 ) 1 + for the jth class of samples and a negative 
class output y~ = (r — rj) If for all the non-jth class of sam¬ 
ples, where 1 + = [ 1 ,..., 1 ] T £ R”L and 1 “ = [ 1 ,..., 1 ] T £ 

R r V . A common setting for threshold (r) and bias ( 77 ) will be 
set for all the m outputs. The output weight vector /3 J in jth 
binary TER-ELM is calculated as 




h 7 Th 7 


:+h; t *V 


— H 

m- 


-T - 


m 


T H 7 


+T„,+ 


(54) 


where HJ and H ; denote the hidden-layer matrices of the jth 
binary TER-ELM corresponding to the positive and negative 
samples, respectively. 

By defining two class-specific diagonal weighting 
matrices = diag(0 ,..., 0, 1/mlf ,..., m^) and 

Wj = diag(l/m~,..., mj, 0,..., 0), the solution formula 
(54) of TER-ELM can be written as 

0; - • HjW,TT,) H! W ,y , (55) 


where W ; = W - + Wj and the elements of H ? and y j are 
ordered according to the positive and negative samples of the 
two classes (jth class samples and all the non-jth class sam¬ 
ples). In order to improve the stability of the learning, I/C is 
introduced in the aforementioned formula. If the dimensionality 
of the hidden layer is much larger than the number of the 
training data (L N), an alternative solution suggested in 
[ 22 ] is 

/3,=Hj (^+W ) H J Hj^) W.y,. (56) 


Kernels and generalized feature mappings are not considered 
in TER-ELM. 







522 


IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 2, APRIL 2012 


Deng et al. [21] mainly focus on the case where L < N, 
and (37) of ELM and the solution formula of Deng et al. 
[21] look similar to each other. However, different from the 
ELM solutions provided in this paper, Deng et al. [21] do 
not consider kernels and generalized feature mappings in their 
weighted regularized ELM. In the proposed solutions of ELM, 
L hidden nodes may have a different type of hidden-node output 
function /ii(x) : h(x) = [hi(x),...,/i£,(x)], while in [ 21 ], all 
the hidden nodes use the Sigmoid type of activation functions. 
Deng et al. [21] do not handle the alternative solution (31). 

Seen from (37), multivariate polynomial model [45] can be 
considered as a specific case of ELM. 

The original solutions (21) of ELM [12], [13], [26], TER- 
ELM [22], and the weighted regularized ELM [21] are not 
able to apply kernels in their implementations. With the new 
suggested approach, kernels can be used in ELM [cf. (49)]. 

E. Milder Optimization Constraints 

In LS-SVM, as the feature mapping </>(x) is usually un¬ 
known, it is reasonable to think that the separating hyperplane 
in LS-SVM may not necessarily pass through the origin in the 
LS-SVM feature space, and thus, a term bias b is required in 
their optimization constraints: L(w • <£(xj) + b) = 1 — The 
corresponding KKT condition (necessary condition) [cf. (10b)] 
for the conventional LS-SVM is a: iL = 0. Poggio et al. 
[46] prove in theory that the term bias b is not required in 
positive definite kernel and that it is not incorrect to have the 
term bias b in the SVM model. Different from the analysis 
of Poggio et al. [46], Huang et al. [29] show that, from the 
practical and universal approximation point of view, the term 
bias b should not be given in the ELM learning. 

According to ELM theories [ 12]—[16], almost all nonlinear 
piecewise continuous functions as feature mappings can make 
ELM satisfy universal approximation capability, and the sep¬ 
arating hyperplane of ELM basically tends to pass through 
the origin in the ELM feature space. There is no term bias b 
in the optimization constraint of ELM, h(x,)/3 = t, and 
thus, different from LS-SVM, ELM does not need to satisfy 
the condition = 0- Although LS-SVM and ELM 

have the same primal optimization formula, ELM has milder 
optimization constraints than LS-SVM, and thus, compared to 
ELM, LS-SVM obtains a suboptimal optimization. 

The differences and relationships among ELM, LS- 
SVM/PSVM, and SVM can be summarized in Table I. 

V. Performance Verification 

This section compares the performance of different algo¬ 
rithms (SVM, LS-SVM, and ELM) in real-world benchmark 
regression, binary, and multiclass classification data sets. In 
order to test the performance of the proposed ELM with various 
feature mappings in supersmall data sets, we have also tested 
ELM on the XOR problem. 

A. Benchmark Data Sets 

In order to extensively verify the performance of different 
algorithms, wide types of data sets have been tested in our 
simulations, which are of small sizes, low dimensions, large 


TABLE II 

SPECIFICATION OF BINARY CLASSIFICATION PROBLEMS 


Datasets 

# train 

# test 

# features 

Random 

Perm 

Diabetes 

512 

256 

8 

Yes 

Australian Credit 

460 

230 

6 

Yes 

Liver 

230 

115 

6 

Yes 

Banana 

400 

4900 

2 

No 

Colon 

30 

32 

2000 

No 

Colon (Gene Sel) 

30 

32 

60 

No 

Leukemia 

38 

34 

7129 

No 

Leukemia (Gene Sel) 

38 

34 

60 

No 

Brightdata 

1000 

1462 

14 

Yes 

Dimdata 

1000 

3192 

14 

Yes 

Mushroom 

1500 

6624 

22 

Yes 

Adult 

4781 

27780 

123 

No 


sizes, and/or high dimensions. These data sets include 12 binary 
classification cases, 12 multiclassification cases, and 12 regres¬ 
sion cases. Most of the data sets are taken from UCI Machine 
Learning Repository [47] and Statlib [48]. 

1) Binary Class Data Sets: The 12 binary class data sets 
(cf. Table II) can be classified into four groups of data: 

1 ) data sets with relatively small size and low dimensions, 
e.g., Pima Indians diabetes, Statlog Australian credit, 
Bupa Liver disorders [47], and Banana [49]; 

2 ) data sets with relatively small size and high dimensions, 
e.g., leukemia data set [50] and colon microarray data set 

[51]; 

3) data sets with relatively large size and low dimensions, 
e.g., Star/Galaxy-B right data set [52], Galaxy Dim data 
set [52], and mushroom data set [47]; 

4) data sets with large size and high dimensions, e.g., adult 
data set [47], 

The leukemia data set was originally taken from a collection 
of leukemia patient samples [53]. The data set consists of 
72 samples: 25 samples of AML and 47 samples of ALL. 
Each sample of leukemia data set is measured over 7129 genes 
(cf. Leukemia in Table II). The colon microarray data set 
consists of 22 normal and 40 tumor tissue samples. In this data 
set, each sample of colon microarray data set contains 2000 
genes (cf. Colon in Table II). 

Performances of the different algorithms have also been 
tested on both leukemia data set and colon microarray data 
set after the minimum-redundancy-maximum-relevance fea¬ 
ture selection method [54] being taken (cf. Leukemia (Gene Sel) 
and Colon (Gene Sel) in Table II). 

2) Multiclass Data Sets: The 12 multiclass data sets (cf. 
Table III) can be classified into four groups of data as well: 

1 ) data sets with relatively small size and low dimensions, 
e.g., Iris, Glass Identification, and Wine [47]; 

2 ) data sets with relatively medium size and medium dimen¬ 
sions, e.g.. Vowel Recognition, Statlog Vehicle Silhou¬ 
ettes, and Statlog Image Segmentation [47]; 

3) data sets with relatively large size and medium dimen¬ 
sions, e.g., letter and shuttle [47]; 

4) data sets with large size and/or large dimensions, e.g., 
DNA, Satimage [47], and USPS [50]. 

3) Regression Data Sets: The 12 regression data sets (cf. 
Table IV) can be classified into three groups of data: 

1 ) data sets with relatively small size and low dimensions, 
e.g., Basketball, Strike [48], Cloud, and Autoprice [47]; 
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TABLE I 

Feature Comparisons Among ELM, LS-SVM, and SVM 



ELM 

LS-SVM and PSVM 

SVM 

Feature 

mapping 

i) h(x) is usually known to users (Wide 
type of nonlinear piecewise continuous 
functions (with randomly generated pa¬ 
rameters) can be used [14]—[16]: e.g. addi¬ 
tive, RBF, trigonometric, threshold, fully 
complex, high-order, ridge polynomial, 
etc.) 

ii) If h(x) is unknown, kernels can be used 
in ELM then. 

<£(x) is usually unknown to users and kernels are 
often used. 

</>(x) is usually unknown to users and 
kernels are often used. 

Kernel 

matrix 

Helm = [tty] where 
tty = if(xj,Xj) 

ft LS-SVM = ft psvm = Pij] where 
ftij = ti.tj K (xi , Xj ) 

Osvm = [I2y] where 
fiy = K(x.i, Xj) 

Universal 

approx¬ 

imation 

capability 

Ensured for almost all type of nonlinear piece- 
wise random hidden nodes and often used 
kernels 

Not ensured. It depends on the kernel to be used 

Not ensured. It depends on the kernel 
to be used 

Support vec¬ 
tors based 

No (cti are proportional to the training errors) 

No (a* are proportional to the training errors) 

Yes (Support vectors x* corresponding 
to a t = 0) 

Conditions 
on a-i 

No conditions on oti 

For LS-SVM: i L «> = 0; For PSVM: 

Efei Lai = b 

YliLl ti&i — 0 an d 0 < Oti < C 

Multi classes 

Single ELM for m classes 

m binary LS-SVM for m classes, depending on 
coding/decoding scheme 

m or m(m — l)/2 binary SVM for m 
classes 

Regression 

ELM unified for both regression and binary / 
multi-class classification 

Different variants required for regression and 
classification 

Different variants required for regres¬ 
sion and classification 

Output 

function 

Kernel based: 

" K (x, xi)" T 

f( x ) = (^+ftELM^ T 

_/X(x,xjv)_ 

Non-kernel based: 

f(x) P h(x)H T ^ + HH T j ' T 

f(x) = h(x) (i + H T H) 1 H 7 T 

Kernel based LS-SVM and PSVM: 

/(x) = sign aiti A'(x. Xi) + b^j 

Non-kernel based PSVM: 

/(x) = sign ^fx)Z T ^ i- + ZZ 7 ' f TT T j I + b j 

where 

b = ^ + ZZ T + TT 7 ^ IT 

/(x) =sign^y]aitiA:(x,Xi) + b j 


TABLE III 

Specification of Multiclass Classification Problems 


Datasets 

# train 

# test 

# features 

# classes 

Random 

Perm 

Iris 

100 

50 

4 

3 

Yes 

Glass 

142 

72 

9 

6 

Yes 

Wine 

118 

60 

13 

3 

Yes 

Ecoli 

224 

112 

7 

8 

Yes 

Vowel 

528 

462 

10 

11 

No 

Vehicle 

564 

282 

18 

4 

Yes 

Segment 

1540 

770 

19 

7 

Yes 

Satimage 

4435 

2000 

36 

6 

No 

DNA 

2000 

1186 

180 

3 

No 

Letter 

13333 

6667 

16 

26 

Yes 

Shuttle 

43500 

14500 

9 

7 

No 

USPS 

7291 

2007 

256 

10 

No 


2 ) data sets with relatively small size and medium dimen¬ 
sions, e.g., Pyrim, Housing [47], Bodyfat, and Cleve¬ 
land [48]; 

3) data sets with relatively large size and low dimensions, 
e.g., Balloon, Quake, Space-ga [48], and Abalone [47]. 

Column “random perm” in Tables II-IV shows whether the 
training and testing data of the corresponding data sets are 
reshuffled at each trial of simulation. If the training and testing 
data of the data sets remain fixed for all trials of simulations, it 
is marked “No.” Otherwise, it is marked “Yes.” 


TABLE IV 

Specification of Regression Problems 


Datasets 

# train 

# test 

# features 

Random Perm 

Baskball 

64 

32 

4 

Yes 

Cloud 

72 

36 

9 

Yes 

Autoprice 

106 

53 

9 

Yes 

Strike 

416 

209 

6 

Yes 

Pyrim 

49 

25 

27 

Yes 

Bodyfat 

168 

84 

14 

Yes 

Cleveland 

202 

101 

13 

Yes 

Housing 

337 

169 

13 

Yes 

Balloon 

1334 

667 

2 

Yes 

Quake 

1452 

726 

3 

Yes 

Space-ga 

2071 

1036 

6 

Yes 

Abalone 

2784 

1393 

8 

Yes 


B. Simulation Environment Settings 

The simulations of different algorithms on all the data sets 
except for Adult, Letter, Shuttle, and USPS data sets are carried 
out in MATLAB 7.0.1 environment running in Core 2 Quad, 
2.66-GHZ CPU with 2-GB RAM. The codes used for SVM and 
LS-SVM are downloaded from [55] and [56], respectively. 

Simulations on large data sets (e.g., Adult, Letter, Shuttle, and 
USPS data sets) are carried out in a high-performance computer 
with 2.52-GHz CPU and 48-GB RAM. The symbol marked 
in Tables VI and VII indicates that the corresponding data sets 
are tested in such a high-performance computer. 
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Fig. 2. Performances of LS-SVM and ELM with Gaussian kernel are sensitive 
to the user-specified parameters (C, 7 ): An example on Satimage data set. 
(a) LS-SVM with Gaussian kernel, (b) ELM with Gaussian kernel. 

C. User-Specified Parameters 

The popular Gaussian kernel function AT(u, v) = 
exp(— 7 ||u — v|| 2 ) is used in SVM, LS-SVM, and ELM. 
ELM performance is also tested in the cases of Sigmoid type 
of additive hidden node and multiquadric RBF hidden node. 

In order to achieve good generalization performance, the cost 
parameter C and kernel parameter 7 of SVM, LS-SVM, and 
ELM need to be chosen appropriately. We have tried a wide 
range of C and 7. For each data set, we have used 50 different 
values of C and 50 different values of 7, resulting in a total of 
2500 pairs of (C, 7). The 50 different values of C and 7 are 
{ 2 ~ 24 , 2 ~ 23 ,..., 2 24 , 2 25 }. 

It is known that the performance of SVM is sensitive to 
the combination of (C, 7). Similar to SVM, the generalization 
performance of LS-SVM and ELM with Gaussian kernel de¬ 
pends closely on the combination of (G, 7) as well (see Fig. 2 
for the performance sensitivity of LS-SVM and ELM with 
Gaussian kernel on the user-specified parameters (G, 7)). The 
best generalization performance of SVM, LS-SVM, and ELM 
with Gaussian kernel is usually achieved in a very narrow range 
of such combinations. Thus, the best combination of (G, 7) of 
SVM, LS-SVM, and ELM with Gaussian kernel needs to be 
chosen for each data set. 




Fig. 3. Performance of ELM (with Sigmoid additive node and multiquadric 
RBF node) is not very sensitive to the user-specified parameters (C, L), and 
good testing accuracies can be achieved as long as L is large enough: An 
example on Satimage data set. (a) ELM with Sigmoid additive node, (b) ELM 
with multiquadric RBF node. 

For ELM with Sigmoid additive hidden node and multi¬ 
quadric RBF hidden node, h(x) = [G(a 1; b 1 , x),..., G(a£, 
&l,x)], where G(a, b, x) = 1/(1 + exp(— (a • x + b))) 
for Sigmoid additive hidden node or G(a, b, x) = 
(||x — a|| 2 + b 2 ) ' for multiquadric RBF hidden node. 
All the hidden-node parameters (a i,bi) i=1 are randomly 
generated based on uniform distribution. The user-specified 
parameters are (G, L), where G is chosen from the range 
{2~ 24 ,2~ 23 ,..., 2 24 ,2 25 }. Seen from Fig. 3, ELM can achieve 
good generalization performance as long as the number of 
hidden nodes L is large enough. In all our simulations on ELM 
with Sigmoid additive hidden node and multiquadric RBF 
hidden node, L = 1000. In other words, the performance of 
ELM with Sigmoid additive hidden node and multiquadric 
RBF hidden node is not sensitive to the number of hidden 
nodes L. Moreover, L need not be specified by users; instead, 
users only need to specify one parameter: G. 

Fifty trials have been conducted for each problem. Simula¬ 
tion results, including the average testing accuracy, the corre¬ 
sponding standard deviation (Dev), and the training times, are 
given in this section. 
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TABLE V 

Parameters of the Conventional SVM. LS-SVM, and ELM 


Datasets 


SVM 

LSSVM 



Extreme Learning Machine 



(Gaussian Kernel) 

(Gaussian Kernel) 












Gaussian Kernel 

Sigmoid Additive Node 

Multiquadrics RBF Node 


C 

7 

0 

7 

C 

7 

C 

L 

C 

L 

Binary class datasets 

Diabetes 

2 10 

2 4 

2 10 

2 10 

2 !0 

2 5 

2 -2 

1000 

2 -2 

1000 

Australian Credit 

2 14 

2 4 

2 7 

2 10 

2 10 

2 9 

2 _1 

1000 

2 _1 

1000 

Liver 

2 18 

2 4 

2 5 

2 7 

2 8 

2 5 

2 1 

1000 

2 2 

1000 

Banana 

2 25 

2 2 

2 14 

2 2 

2° 

2 4 

2 22 

1000 

2 2 

1000 

Colon 

9 12 

2 6 

2 4 

2 16 

2 7 

2 2° 

2 1 

1000 

2- 7 

1000 

Colon (Gene Sel) 

2 2 

2 4 

2° 

2 12 

2 7 

2 2° 

2 - 1 4 

1000 

2 _ 13 

1000 

Leukemia 

2 12 

2 10 

2 10 

2 6 

2 15 

2 2° 

2 17 

1000 

2 13 

1000 

Leukemia (Gene Sel) 

2 8 

2 s 

2 10 

2 i° 

2 15 

2° 

2 6 

1000 

2- 7 

1000 

Brightdata 

2 12 

2 4 

2 2 

2 3 

2 4 

2“ 4 

2 5 

1000 

2 5 

1000 

Dimdata 

2 14 

2 4 

2 3 

2 3 

2 5 

2 s 

2 1 

1000 

2 1 

1000 

Mushroom 

2 20 

2° 

2 3 

2 3 

2 13 

2 4 

2 19 

1000 

2 18 

1000 

Adult 

2 2 

2 2 

2° 

2 7 

2 25 

2 is 

2~ e 

1000 

2- 6 

1000 

Multi-class datasets 

Iris 

1 

2~ 2 

2 14 

2 6 

2° 

2 U 

2 5 

1000 

2 -3 

1000 

Glass 

1 

2 -2 

2 2 

2 2 

2 5 

2° 

2 3 

1000 

2 2 

1000 

Wine 

1 

1 

2 16 

2 14 

2 s 

2 3 

2° 

1000 

2 _1 

1000 

Ecoli 

2 16 

2 5 

2 s 

2 5 

2 22 

2 12 

2 2 

1000 

2° 

1000 

Vowel 

1 

1 

2 s 

2 2 

2 s 

2 _1 

2 1 

1000 

2° 

1000 

Vehicle 

2 14 

2 2 

2 33 

2 20 

2 6 

2 3 

2 7 

1000 

9 io 

1000 

Segment 

2 25 

2 5 

2 6 

2 2 

2 13 

2“ 5 

2 10 

1000 

2 17 

1000 

Satimage 

1 

1 

2 8 

2 2 

2 4 

2 -2 

2 7 

1000 

2 12 

1000 

DNA 

2 18 

2 12 

2 12 

2 8 

2 6 

2 6 

2“ 7 

1000 

2 1 

1000 

Letter 

2 io 

2- J 

2 8 

2° 

2 3 

2 -2 

2 10 

1000 

2 24 

1000 

shuttle 

2 io 

2“ 2 

2 18 

2 -4 

2 30 

9 -io 

2 25 

1000 

2 25 

1000 

USPS 

2 8 

2° 

2 5 

2 s 

2 4 

2 s 

2 io 

1000 

2 20 

1000 

Regression datasets 

Baskball 

2" 

2° 

2° 

2 3 

2" 

2" 

2" 

1000 

2^ 

1000 

Cloud 

2 2 

2° 

2-18 

2- 17 

2 15 

2 8 

2 -s 

1000 

2-1 4 

1000 

Autoprice 

2 la 

2 5 

2 5 

2 2 

2 6 

2 4 

2~ 4 

1000 

2 6 

1000 

Strike 

2° 

2- 4 

2° 

2 -2 

2~ 4 

2 5 

2- 5 

1000 

2 8 

1000 

Pyrim 

2 1 < ) 

2 8 

2 5 

2 7 

2 2 

2® 

2 -3 

1000 

2 3 

1000 

Bodyfat 

2 25 

2 7 

2 25 

2 20 

2 12 

2 ie 

2° 

1000 

2 6 

1000 

Cleveland 

2 2 

2 2 

2 s 

2 14 

2 13 

2 15 

2 - 3 

1000 

2 1 

1000 

Housing 

2 4 

2 2 

2 4 

2 4 

2 2 

2 8 

2 5 

1000 

2 7 

1000 

Balloon 

2 4 

2 -2 

2 io 

2° 

2“° 

2 10 

2 2° 

1000 

2 is 

1000 

Quake 

2 5 

2 5 

2 5 

2 14 

2 5 

2 14 

2° 

1000 

2 10 

1000 

Space-ga 

2 8 

2 _1 

2 s 

2 2 

2 2 

2 2° 

2 4 

1000 

2 43 

1000 

Abalone 

2 1 

2 _1 

2 4 

2 4 

2° 

2° 

2° 

1000 

2° 

1000 


The user-specified parameters chosen in our simulations are 
given in Table V. 

D. Performance Comparison on XOR Problem 

The performance of SVM, LS-SVM, and ELM has been 
tested in the XOR problem which has two training samples in 
each class. The aim of this simulation is to verify whether ELM 
can handle some rare cases such as the cases with extremely 
few training data sets. Fig. 4 shows the boundaries of different 
classifiers in XOR problem. It can be seen that, similar to SVM 
and LS-SVM, ELM is able to solve the XOR problem well. 
User-specified parameters used in this XOR problem are chosen 
as follows: (C, 7 ) for SVM is (2 10 ,2°), (C, 7 ) for LS-SVM is 
(2 4 , 2 14 ), (C, 7 ) for ELM with Gaussian kernel is (2 5 ,2 15 ), 
and (C, L) for ELM with Sigmoid additive hidden node is 
(2°, 3000). 

E. Performance Comparison on Real-World 
Benchmark Data sets 

Tables VI-V11I show the performance comparison of SVM, 
LS-SVM, and ELM with Gaussian kernel, random Sigmoid 
hidden nodes, and multiquadric RBF nodes. It can be seen that 
ELM can always achieve comparable performance as SVM 



Fig. 4. Separating boundaries of different classifiers in XOR problem, 
(a) SVM. (b) LS-SVM. (c) ELM (Gaussian kernel), (d) ELM (Sigmoid additive 
node). 

and LS-SVM with much faster learning speed. Seen from 
Tables VI-VIII, different output functions of ELM can be used 
in different data sets in order to have efficient implementation 
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TABLE VI 

Performance Comparison of SVM, LS-SVM, and ELM: Binary Class Data Sets 


Datasets 


SVM 


LSSVM 

Extreme Learning Machine 








Gaussian Kernel 

Sigmoid Additive Node 

Multiquadrics RBF Node 


Testing 

Training 

Testing 

Training 

Testing 

Training 

Testing 

Training 

Testing 

Training 


Rate 

Dev 

Time 

Rate 

Dev 

Time 

Rate 

Dev 

Time 

Rate 

Dev 

Time 

Rate 

Dev 

Time 


(%) 

(%) 

(s) 

(%) 

(%) 

(s) 

(%) 

(%) 

(s) 

(%) 

(%) 

(s) 

(%) 

(%) 

(s) 











/(x) = sign (h(x)H 7 

'(i + HH' r ) 

-t) 

Diabete 

76.97 

2.70 

0.6759 

77.18 

2.07 

0.1406 

77.52 

2.46 

0.0528 

77.95 

2.18 

0.2075 

78.09 

2.17 

0.2306 

Australian 

85.79 

2.03 

0.7042 

85.91 

1.85 

0.1250 

86.29 

1.43 

0.0403 

86.18 

1.80 

0.1709 

86.70 

1.90 

0.1691 

Credit 

Liver 

72.65 

3.61 

0.5616 

71.98 

3.37 

0.0625 

72.14 

3.74 

0.0066 

73.01 

3.78 

0.0528 

71.57 

0.04 

0.0531 

Banana 

89.84 

0 

0.8120 

89.63 

0 

0.0620 

89.83 

0 

0.0469 

89.61 

0.05 

0.1350 

89.30 

0.01 

0.1416 

Colon 

84.38 

0 

0.1617 

81.25 

0 

0.4531 

84.38 

0 

0.0031 

81.63 

3.32 

0.1103 

82.13 

1.55 

0.1472 

Colon 

84.38 

0 

0.0462 

87.50 

0 

0.0469 

90.63 

0 

0.0010 

89.62 

2.85 

0.0075 

87.50 

0 

0.0072 

(Gene Sel) 
Leukemia 

82.34 

0 

1.007 

85.29 

0 

1.703 

82.35 

0 

0.0309 

80.08 

3.85 

0.4288 

83.47 

3.41 

0.5550 

Leukemia 
(Gene Sel) 

100 

0 

0.0494 

100 

0 

0.0625 

100 

0 

0.0003 

100 

0 

0.0075 

100 

0 

0.0106 











/(x) = sign ^h(x) 

- + h t h) 1 H t T^ 

Brightdata 

99.46 

0.23 

1.289 

99.24 

0.17 

0.5413 

98.91 

0.25 

0.2984 

99.31 

0.2 

0.7573 

99.31 

0.17 

0.7401 

Dimdata 

95.85 

0.27 

0.8908 

95.19 

0.35 

0.5781 

95.89 

0.34 

0.2734 

95.75 

0.29 

0.749 

95.67 

0.26 

0.75 

Mushroom 

89.88 

0.43 

46.56 

88.87 

0.41 

1.531 

88.84 

0.36 

0.8133 

88.91 

0.36 

1.038 

88.88 

0.33 

1.047 

* Adult 

84.51 

0 

29.382 

84.79 

0 

5.5703 

84.58 

0 

3.3116 

84.59 

0.05 

0.4246 

84.51 

0.02 

0.5682 


TABLE VII 

Performance Comparison of SVM, LS-SVM, and ELM: Multiclass Data Sets 


Datasets 


SVM 


LSSVM 

Extreme Learning Machine 








Gaussian Kernel 

Sigmoid Additive Node 

Multiquadrics RBF Node 


Testing 

Training 

Testing 

Training 

Testing 

Training 

Testing 

Training 

Testing 

Training 


Rate 

Dev 

Time 

Rate 

Dev 

Time 

Rate 

Dev 

Time 

Rate 

Dev 

Time 

Rate 

Dev 

Time 


<%) 

<%) 

(s) 

<%) 

<%) 

(s) 

(%> 

<%) 

(s) 

(%) 

<%) 

(s) 

<%) 

(%) 

(s) 












f(x) = 

= h(x)H T ( 

i+HH 7 ')" 1 

T 

Iris 

95.12 

2.45 

0.075 

96.28 

2.36 

0.0021 

96.04 

2.37 

0.0022 

97.6 

2.29 

0.0156 

97.33 

2.12 

0.0161 

Glass 

67.83 

4.67 

0.2871 

67.22 

5.04 

0.0097 

68.41 

4.81 

0.0026 

67.12 

4.99 

0.0262 

66.89 

4.97 

0.0264 

Wine 

98.37 

1.41 

0.075 

97.63 

1.82 

0.0043 

98.48 

1.7 

0.0019 

98.47 

1.81 

0.0222 

98.57 

1.26 

0.0206 

Ecoli 

86.56 

3.65 

0.2469 

85.93 

2.82 

0.0244 

87.48 

2.8 

0.008 

87.23 

2.88 

0.053 

87.79 

2.74 

0.054 

Vowel 

56.28 

0 

2.172 

52.81 

0 

0.3290 

58.66 

0 

0.0688 

53.73 

0.91 

0.2187 

54.75 

0.89 

0.2231 

Vehicle 

84.37 

1.71 

1.5144 

83.19 

1.93 

0.2029 

83.16 

1.89 

0.0831 

83.48 

1.78 

0.2557 

83.95 

1.85 

0.2547 












f(x) = 

h(x) (i + H 7 h' 

1 H 7 

T 

Segment 

96.53 

0.64 

14.30 

96.12 

0.75 

4.302 

96.53 

0.54 

0.9272 

96.07 

0.69 

1.889 

95.54 

0.41 

1.070 

Satimage 

89.75 

0 

698.4 

90.05 

0 

82.67 

92.35 

0 

15.70 

89.8 

0.32 

2.808 

89.06 

0.38 

2.803 

DNA 

92.86 

0 

7732 

93.68 

0 

6.359 

96.29 

0 

2.156 

93.81 

0.24 

1.586 

94.81 

0.32 

1.597 

* Letter 

92.87 

0.26 

302.9 

93.12 

0.27 

335.838 

97.41 

0.13 

41.89 

93.51 

0.15 

0.7881 

93.96 

0.15 

1.4339 

* shuttle 

99.74 

0 

2864.0 

99.82 

0 

24767.0 

99.91 

0 

4029.0 

99.64 

0.01 

3.3379 

99.65 

0.02 

5.5455 

* USPS 

96.14 

0 

12460 

96.76 

0 

59.1357 

98.9 

0 

9.2784 

96.28 

0.28 

0.6877 

97.25 

0.24 

0.9008 


in different size of data sets, although any output function can 
be used in all types of data sets. 

Take Shuttle (large number of training samples) and USPS 
(medium size of data set with high input dimensions) data sets 
in Table VII as examples. 

1) For Shuttle data sets, ELM with Gaussian kernel and 
random multiquadric RBF nodes runs 6 and 4466 times 
faster than LS-SVM, respectively. 

2) For USPS data sets, ELM with Gaussian kernel and 
random multiquadric RBF nodes runs 6 and 65 times 
faster than LS-SVM, respectively, and runs 1342 and 
13 832 times faster than SVM, respectively. 

On the other hand, different from LS-SVM which is sensitive 
to the combinations of parameters (C, 7), ELM with random 
multiquadric RBF nodes is not sensitive to the unique user- 
specified parameter C [cf. Fig. 3(b)] and is ease of use in the 
respective implementations. 

Tables VI-VIII particularly highlight the performance com¬ 
parison between LS-SVM and ELM with Gaussian kernel, and 


among the comparisons of these two algorithms, apparently, 
better test results are given in boldface. It can be seen that 
ELM with Gaussian kernel achieves the same generalization 
performance in almost all the binary classification and regres¬ 
sion cases as LS-SVM at much faster learning speeds; however, 
ELM usually achieves much better generalization performance 
in multiclass classification cases (cf. Table VII) than LS-SVM. 

Fig. 5 shows the boundaries of different classifiers in Banana 
case. It can be seen that ELM can classify different classes 
well. 

VI. Conclusion 

ELM is a learning mechanism for the generalized SLFNs, 
where learning is made without iterative tuning. The essence of 
ELM is that the hidden layer of the generalized SLFNs should 
not be tuned. Different from traditional learning theories on 
learning, ELM learning theory [ 14]—[ 16] shows that if SLFNs 
f(x) = h(x)/3 with tunable piecewise continuous hidden-layer 
feature mapping h(x) can approximate any target continuous 
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TABLE VIII 

Performance Comparison of SVM, LS-SVM, and ELM: Regression Data Sets 


Datasets 


SVR 



LSSVR 


Extreme Learning Machine 








Gaussian Kernel 

Sigmoid Additive Node 

Multiquadrics RBF Node 


Testing 

Training 

Testing 

Training 

Testing 

Training 

Testing 

Training 

Testing 

Training 


RMSE 

Dev 

Time(s) 

RMSE 
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Fig. 5. Separating boundaries of different classifiers in Banana case, (a) SVM. 
(b) LS-SVM. (c) ELM (Gaussian kernel), (d) ELM (Sigmoid additive node). 

functions, tuning is not required in the hidden layer then. All 
the hidden-node parameters which are supposed to be tuned 
by conventional learning algorithms can be randomly generated 
according to any continuous sampling distribution [ 14]—[ 16]. 

This paper has shown that both LS-SVM and PSVM can 
be simplified by removing the term bias b and the resultant 
learning algorithms are unified with ELM. Instead of different 
variants requested for different types of applications, ELM can 
be applied in regression and multiclass classification appli¬ 
cations directly. More importantly, according to ELM theory 
[14]—[16], ELM can work with a widespread type of fea¬ 
ture mappings (including Sigmoid networks, RBF networks, 
trigonometric networks, threshold networks, fuzzy inference 
systems, fully complex neural networks, high-order networks, 
ridge polynomial networks, etc). 

ELM requires less human intervention than SVM and LS- 
SVM/PSVM. If the feature mappings h(x) are known to users, 
in ELM, only one parameter C needs to be specified by users. 
The generalization performance of ELM is not sensitive to the 
dimensionality L of the feature space (the number of hidden 
nodes) as long as L is set large enough (e.g., L > 1000 for 


all the real-world cases tested in our simulations). Different 
from SVM, LS-SVM, and PSVM which usually request two 
parameters (C, 7) to be specified by users, single-parameter 
setting makes ELM be used easily and efficiently. 

If feature mappings are unknown to users, similar to SVM, 
LS-SVM, and PSVM, kernels can be applied in ELM as well. 
Different from LS-SVM and PSVM, ELM does not have con¬ 
straints on the Lagrange multipliers a,’s. Since LS-SVM and 
ELM have the same optimization objective functions and LS- 
SVM has some optimization constraints on Lagrange multipli¬ 
ers cti’s, in this sense, LS-SVM tends to obtain a solution which 
is suboptimal to ELM. 

As verified by the simulation results, compared to SVM 
and LS-SVM ELM achieves similar or better generalization 
performance for regression and binary class classification cases, 
and much better generalization performance for multiclass clas¬ 
sification cases. ELM has better scalability and runs at much 
faster learning speed (up to thousands of times) than traditional 
SVM and LS-SVM. 

This paper has also shown that, in theory, ELM can approx¬ 
imate any target continuous function and classify any disjoint 
regions. 
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